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Abstract. A popular idea in Computer Assisted Language Learning (CALL) is to use 
multimodal annotated texts, with annotations typically including embedded audio 
and translations, to support L2 learning through reading. An important question is 
how to create the audio, which can be done either through human recording or by 
a Text-To-Speech (TTS) synthesis engine. We may reasonably expect TTS to be 
quicker and easier, but humans to be of higher quality. Here, we report a study 
using the open-source LARA platform and ten languages. Samples of LARA audio 
totaling about three and a half minutes were provided for each language in both 
human and TTS form; subjects used a web form to compare different versions of 
the same item and rate the voices as a whole. Although human voice was more often 
preferred, TTS achieved higher ratings in some languages and was close in others. 
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4: Introduction 


An increasingly popular idea over the last decade is to help L2 learners improve 
their reading skills in non-native languages by creating annotated multimedia texts 
that contain integrated help, most commonly word translations and/or audio. High 
profile examples include LingQ'® and Learning With Texts'”. 


In this study, our focus is the audio, created either by recording human voice or 
through a TTS engine. Using TTS is faster, but despite on-going improvements in 
TTS technology, human-recorded audio is still of higher voice quality. It is less 
clear how large the difference is, or how important it is in practice when TTS is 
used in L2 teaching. Our study addresses these questions. 


2. Method 


The experiments were performed using LARA’ (Akhlaghi et al., 2020), a 
learning-by-reading platform under development by an international open-source 
consortium since 2018. So far, most LARA texts have used human-recorded audio, 
though the Irish LARA group has consistently used TTS. In our study, we selected 
existing LARA texts in Danish, English, Farsi, French, Icelandic, Irish, Italian, 
Mandarin, Spanish, and Swedish, creating a version using the other method so that 
it was available in both human and TTS voice. For each language, a single human 
voice was used and TTS audio was created using the best TTS engine available 
to us for that language: ABAIR" for Irish, Google TTS”? for Mandarin, Nuance 
Vocalizer?! for Farsi, and ReadSpeaker” for the other languages. 


We randomly selected contiguous passages from the texts so that the total audio 
for each language was about three and a half minutes; for some languages, 
we also included individual words. The material was presented on an openly 
available anonymous web form consisting of three portions: demographic data; 
item-by-item comparison of the audio; and overall impressions of the two 
voices. In the item-by-item comparison, subjects chose between “both acceptable 
and roughly equal’, “both acceptable but one clearly better’, ‘one acceptable, 
one not acceptable’, and ‘neither acceptable’. In the overall impressions part, 


16. https://www.lingg.com/ 

17. https://sourceforge.net/projects/lwt/ 

18. https://www.unige.ch/callector/lara/ 

19. http://www.abair.ie/ 

20. https://cloud.google.com/text-to-speech 

21. https://www.nuance.com/en-au/omni-channel-customer-engagement/voice-and-ivr/text-to-speech/vocalizer.html 
22. https://www.readspeaker.com/ 
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subjects gave Likert scale scores for quality of individual words, quality of whole 
sentences, speed, naturalness, pleasantness, suitability for teaching, suitability for 
imitating, and a freeform response. Full details are posted in the supplementary 
materials. 


3. Results 


Responses were logged for 130 subjects and collated using a script. There were 
large differences between languages, between responses for sentences and words, 
and between native and non-native judgments. Table |, Table 2, Table 3, and 
Table 4 show results for the portions of the data we considered most informative. 
Full details are posted in the supplementary materials. 


Table 1. Overall impressions of voices, five-point Likert scale; ratings from 
native/near-native speakers only. In each cell, human rating above 
and TTS rating below. Yellow=TTS equal or better than human, 
orange=TTS within 0.5 of human 


Language | DA 
(#raters) 


Sentences 


Speed 4.14 
2.57 
Natural 4.29 
1.86 
Pleasant (4.14 
2.43 
Teaching | 4.43 
2.43 
Imitating | 4.43 
2.14 


Table 2. Item-by-item comparison averages; percentage ratings from native/ 
near-native speakers only, sentences only. Yellow=TTS equal or 
better than human, orange=TTS within 10% of human 


Language |DA |EN_ |FA FR IS IE IT ZH |SP SW 


(#raters) (7) (03) @7) (BC) [© © 7% |@2@ |@ @) 
(Hitems) |(14) |(8) |(20) |(22) |(15) |(39) |(23) (17) |(16) | (14) 
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Human —_98.0_ | 100.0 |98.3 94.4 100.0 

acceptable 

TTS 418 88.5 66.5 81.1 40.6 

acceptable 

Human (81.6 43.3 70.9 53.3 100.0 | 40.5 

better 

TTS better 3.1 (20.2 6.1 14.4 1/7. : 0.0 17.1 

(same) (15.3) (36.5) (23.0) | (66.7) (32.2) | (45.3) (42.9) (50.0) (0.0) | (52.4) 

Table 3. Overall impressions of voices; teachers/trainee teachers only 
(conventions as in Table 1) 

Language FA Is IE IT ZH SP SW 

(#raters) (13) (9) qa) |) 

4.69 4.33 5.0 (5.0 
3.15 3,22 2.0 3.0 

Sentences 4.85 4.22 5.0 5.0 
2.91 2.92 3,22 2.0 3.0 

Speed 4.14 4.85 4.22 So 
py 3.15 3.67 4.0 

Natural 4.29 5.0 4.67 : 5.0 5.0 
1.86 2.46 2.78 3.0 1.0 3.0 

Pleasant 4.14 477 4.44 5.0 5.0 ao | 
2.43 2.85 3,22 3.5 2.0 

Teaching | 4.43 4.92 4.56 5.0 5.0 4.0 
2.43 2.31 3.22 3.5 1.0 3.0 

Imitating | 4.43 4.69 4.11 5.0 5.0 4.0 
2.14 2.38 2.67 3.5 1.0 2.0 

Table 4. Item-by-item comparison averages; teachers only, sentences only 
(conventions as in Table 2) 

Language | DA EN FA FR IS IE IT ZH SP SW 

(raters) |) (16 (13) 14) 1O (25) 12) 12 WO | 

(fitems) (14) (8) (20) (22) (15) 

Human 98.0 97.3 94.8 

acceptable 

TTS 41.8 55.0 81.5 

acceptable 

Human 81.6 78.1 60.0 | 33.2 

better 

TTS better | 3.1 4.2 15.6 | 11.8 2.9 0.0 0.0 

(same) (15.3) (35.9) (17.7) | (56.8) | (24.4) | (55.0) (39.1) | (50.0) (0.0) | (14.3) 
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4. Discussion and conclusions 


This is a preliminary study in a rapidly evolving field, mostly using only one text 
per language, with genres ranging from simple children’s stories to literary novels. 
The human voices were a mixture of male and female ranging from experienced 
teachers to a twelve-year-old child, while all but one of the TTS voices were young 
females. 


With the above caveats, human audio was more often preferred than TTS, but 
this was by no means always the case; the gap was surprisingly close. Some 
TTS engines are better than others: the English, Irish, and Italian speech engines 
used clearly outperform the Danish and Farsi ones. TTS engines did very well on 
pronunciation (high scores in the ‘Words’ rows), but less well on sentence-level 
phenomena such as prosody, coarticulation processes, speed, etc. (lower scores in 
the ‘Sentences’ rows). Teachers rated TTS more highly than native speakers did 
(comparing Table | and Table 2 with Table 3 and Table 4). Non-native speakers and 
non-teachers rated TTS even more highly (see supplementary materials). 


We are planning an extended study using a larger sample of texts and voices. 


5. Supplementary materials 


Relevant LARA texts, data collection form, and full results: https://www.issco. 
unige.ch/en/research/projects/callector/EUROCALL_2021_data.html 
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